Morphology-Based Language Modeling for Amharic Dissertationsschrift zum Erlangung des Grades

نویسندگان

  • Martha Yifiru Tachbelie
  • Christopher Habel
  • Solomon Teferra Abate
  • Deborah Solomon
چکیده

Language models are fundamental for many natural language processing applications. The most widely used type of language models are the corpus-based probabilistic ones. These models provide an estimate of the probability of a word sequence W based on training data. Therefore, large amounts of training data are required in order to ensure statistical significance. But even if the training data are very large, it is impossible to avoid the problems of data sparseness and out-of-vocabulary (OOV) words. These problems are particularly serious for languages with a rich morphology, which are characterized with high vocabulary growth rate and a correspondingly high perplexity of their language models. Since the vocabulary size directly affects system complexity, a promising direction is towards the use of sub-word units in language modeling. This study explored different ways of language modeling for Amharic, a morphologically rich Semitic language, using morphemes as units. Morpheme-based language models have been trained on automatically and manually segmented data using the SRI Language Modeling toolkit (SRILM). The quality of these models has been assessed in terms of perplexity, the probability they assign to the test set, and the improvement in word recognition accuracy obtained as a result of using them in a speech recognition task. The results show that the morpheme-based language models trained on manually segmented data always have a higher quality. A comparison with word-based models reveals that the word-based models fared better in terms of the probability they assigned to the test set. In terms of word recognition accuracy, however, interpolated (morphemeand word-based) models achieved the best results. In addition, the morpheme-based models reduced the OOV rate considerably. Since using morpheme-based language models in a lattice rescoring framework does not solve the OOV problem, speech recognition experiments in which morphemes are used as dictionary entries and language modeling units have been conducted. The use of morphemes highly reduced the OOV rate and consequently boosted the word recognition accuracy of the 5k vocabulary morpheme-based speech recognition system. However, as morpheme-based recognition systems suffer from acoustic confusability and limited n-gram language model scope, their performance with a larger morph vocabulary was not as expected. When morphemes are used as units in language modeling, word-level dependencies might be lost. As a solution to this problem we have investigated root-based language models in the framework of factored language modeling. Although this produced far better test set probabilities, the much weaker predictions of a root-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analyse des performances de modèles de langage sub-lexicale pour des langues peu-dotées à morphologie riche

Performance analysis of sub-word language modeling for under-resourced languages with rich morphology : case study on Swahili and Amharic This paper investigates the impact on ASR performance of sub-word units for two underresourced african languages with rich morphology (Amharic and Swahili). Two subword units are considered : syllable and morpheme, the latter being obtained in a supervised or...

متن کامل

Analyse des performances de modèles de langage sub-lexicale pour des langues peu-dotées à morphologie riche (Performance analysis of sub-word language modeling for under-resourced languages with rich morphology: case study on Swahili and Amharic) [in French]

Performance analysis of sub-word language modeling for under-resourced languages with rich morphology : case study on Swahili and Amharic This paper investigates the impact on ASR performance of sub-word units for two underresourced african languages with rich morphology (Amharic and Swahili). Two subword units are considered : syllable and morpheme, the latter being obtained in a supervised or...

متن کامل

Distributed Collaborative Augmented Reality

ausgeführt zum Zwecke der Erlangung des akademischen Grades eines Doktors der technischen Wissenschaften unter der Anleitung von A.o. Prof. Dipl.-Ing. Dr. Michael Gervautz Institut Nr. 186 Institut für Computergraphik und Univ.-Ass. Dipl.-Ing. Dr.techn. Dieter Schmalstieg als betreuendem Assistenten eingereicht an der Technischen Universität Wien Fakultät für Technische Naturwissenschaften und ...

متن کامل

Collision detection and post-processing for physical cloth simulation

Dissertation der Fakultät für Informations-und Kognitionswissenschaften der Eberhard-Karls-Universität Tübingen zur Erlangung des Grades eines Doktors der Naturwissenschaften Modellierungssoftware Alias Maya weiter entwickelt, umüber eine effiziente und komfortable Test-und Visualisierungsumgebung zu verfügen.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010